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Abstract 

We consider the problem of multivariate density estimation when the unknown 
density is assumed to follow a particular form of dimensionality reduction, a noisy 
independent factor analysis (IFA) model. In this model the data are generated by 
a number of latent independent components having unknown distributions and are 
observed in Gaussian noise. We do not assume that cither the number of components 
or the matrix mixing the components are known. We show that the densities of this 
form can be estimated with a fast rate. Using the mirror averaging aggregation 
algorithm, we construct a density estimator which achieves a nearly parametric 
rate (log^^^ n) / \/n^ independent of the dimensionality of the data, as the sample 
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size n tends to infinity. This estimator is adaptive to the number of components, 
their distributions and the mixing matrix. We then apply this density estimator 
to construct nonparametric plug-in classifiers and show that they achieve the best 
obtainable rate of the excess Bayes risk, to within a logarithmic factor independent 
of the dimension of the data. Applications of this classifier to simulated data sets 
and to real data from a remote sensing experiment show promising results. 

Key words: Nonparametric Density Estimation; Independent Factor Analysis; 
Aggregation; Plug-in classifier; Remote sensing. 

1 Introduction 

Complex data sets lying in multidimensional spaces are a commonplace occurrence in 
many areas of science and engineering. There are various sources of this kind of data, 
including biology (genetic networks, gene expression microarrays, molecular imaging 
data), communications (internet data, cell phone networks), risk management, and many 
others. One of the important challenges of the analysis of such data is to reduce its 
dimensionality in order to identify and visualize its structure. 

It is well known that common nonparametric density estimators are quite unreliable 
even for moderately high-dimensional data. This motivates the use of dimensionality 
reduction models. The literature on dimensionality reduction is very extensive, and we 
mention here only some recent publications that are connected to our context and contain 
further references (Roweis and Saul 2000; Tenebaum, de Silva and Langford 2000; Cook 
and Li 2002, Blanchard et al. 2006; Samarov and Tsybakov 2007). 

In this paper we consider the independent factor analysis (IFA) model, which 
generalizes the ordinary factor analysis (FA), principal component analysis (PC A), and 
independent component analysis (ICA). The IFA model was introduced by Attias (1999) 
as a method for recovering independent hidden sources from their observed mixtures. In 
the ordinary FA and PCA, the hidden sources arc assumed to be uncorrclatcd and the 
analysis is based on the covariance matrices, while IFA assumes that the hidden sources 
(factors) are independent and have unknown, non-Gaussian distributions. The ICA, in 
its standard form, assumes that the number of sources is equal to the number of observed 
variables and that the mixtures are observed without noise. Mixing of sources in realistic 
situations, however, generally involves noise and different numbers of sources (factors) 
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and observed variables, and the IFA model allows for both of these extensions of ICA. 

Most of the existing ICA algorithms concentrate on recovering the mixing matrix and 
either assume the known distribution of sources or allow for their limited, parametric 
flexibility, see Hyvarinen, Karhunen and Oja (2001). Attias (1999) and more recent 
IFA papers (An, Xu and Xu 2006; Montanari, Calo and Viroli 2008) either use mixture 
of Gaussian distributions as source models or assume that the number of independent 
sources is known, or both. In the present paper the IFA serves as a dimensionality 
reduction model for multivariate nonparametric density estimation; we suppose that the 
distribution of the sources (factors) and their number are unknown. 

Samarov and Tsybakov (2004) have shown that densities which have the 
standard, noiseless ICA representation can be estimated at an optimal one-dimensional 
nonparametric rate, without knowing the mixing matrix of the independent sources. Here 
our goal is to estimate a multivariate density in the noisy IFA model with unknown 
number of latent independent components observed in Gaussian noise. It turns out that 
the density generated by this model can be estimated with a very fast rate. In Section |2] we 
show that, using recently developed methods of aggregation (Juditsky et al. 2005, 2008), 
we can estimate the density of this form at a parametric root-n rate, up to a logarithmic 
factor independent of the dimension d. 

One of the main applications of multivariate density estimators is in the supervised 
learning. They can be used to construct plug-in classifiers by estimating the densities 
of each labeled class. Recently, Audibert and Tsybakov (2007) have shown that plug-in 
classifiers can achieve fast rates of the excess Bayes risk and under certain conditions 
perform better than classifiers based on the (penalized) empirical risk minimization. A 
difficulty with such density-based plug- in classifiers is that, even when the dimension d 
is moderately large, most density estimators have poor accuracy in the tails, i.e., in the 
region which is important for classification purposes. Amato, Antoniadis and Gregoire 
(2003) have suggested to overcome this problem using the ICA model for multivariate data. 
The resulting method appears to outperform linear, quadratic and fiexible discriminant 
analysis (Hastie, Tibshirani and Buja 1994) in the training set, but its performance 
is rather poor in the testing set. Earlier, Polzehl (1995) suggested a discrimination- 
oriented version of projection pursuit density estimation, which appears to produce quite 
good results but at a high computational cost. His procedure depends on some tuning 
steps, such as bandwidth selection, which are left open and appear to be crucial for the 
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implementation. More recently, Montanari et al. (2008) constructed plug-in classifiers 
based on the IFA model, with the sources assumed to be distributed according to a 
mixture of Gaussian distributions, and reported promising numerical results. 

In Section [3] we give a bound to the excess risk of nonparametric plug-in classifiers 
in terms of the MISE of the density estimators of each class. Combining this bound 
with the results of Section [2| we show that if the data in each class are generated by a 
noisy IFA model, the corresponding plug-in classifiers achieve, within a logarithmic factor 
independent of the dimensionality d, the best obtainable rate of the excess Bayes risk. In 
Section |4] we describe the algorithm implementing our classifier. Section [5] reports results 
of the application of the algorithm to simulated and real data. 

2 Independent factor analysis model for density 
estimation 

We consider the noisy IFA model: 

X = AS + e, (1) 

where A is a c? x m unknown deterministic matrix of factor loadings with unknown 
m < (i, S is an unobserved m-dimensional random vector with independent zero-mean 
components (called factors) having unknown distributions each admitting a density and a 
finite variance, and e is a random vector of noise, independent of S, which we will assume 
to have d-dimensional normal distribution with zero mean and covariance matrix a^Irf, 
cr^ > 0. Here 1^ denotes the dx d identity matrix. 

Assume that we have independent observations Xi, . . . ,X„, where each Xj has the 
same distribution as X. As mentioned in the Introduction, this model is an extension of 
the ICA model, which is widely used in signal processing for blind source separation. In 
the signal processing literature the components of S are called sources rather than factors. 
The basic ICA model assumes e = and m = d (cf., e.g., Hyvarinen et al. 2001). Unlike 
in the signal processing literature, our goal here is to estimate the target density px(-) 
of X, and model ([T]) serves as a particular form of dimensionality reduction for density 
estimation. 

Somewhat different versions of this model where the signal S has not necessarily 
independent components and needs to be non-Gaussian were considered recently by 
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Blanchard et al. (2006), Samarov and Tsybakov (2007). Blanchard et al. (2006) and 
the follow-up paper by Kawanabe et al. (2007) use projection pursuit type techniques to 
identify the non-Gaussian subspace spanned by the columns of A with known number of 
columns m, while Samarov and Tsybakov (2007) propose aggregation methods to estimate 
the density of X when neither the non-Gaussian subspace, nor its dimension are known. 

It is well known that the standard, covariance-based factor analysis model is not fully 
identifiable without extra assumptions (see, e.g., Anderson and Rubin 1956). Indeed, the 
factors are defined only up to an arbitrary rotation. The independence of factors assumed 
in ([T]) excludes this indeterminacy provided that at most one factor is allowed to have a 
Gaussian distribution. This last assumption is standard in the ICA literature and we will 
also make it throughout the paper. We will also assume throughout that the columns of 
A are orthonormal. 

By independence between the noise and the vector of factors S, the target density px 
can be written convolution: 

Px(x) = / ps{s)(f)d,^2{x- As)ds, (2) 

where 0^^0-2 denotes the density of a c?- dimensional Gaussian distribution Nd{0,a'^ld). 

Since in ^ we have a convolution with a Gaussian distribution, the density px has 
very strong smoothness properties, no matter how irregular the density ps of the factors 
is, whether or not the factors are independent, and whether or not the mixing matrix A 
is known. In the Appendix, we construct a kernel estimator p* of px such that 

Ebn-PX2<C' , 3 

n 

where C is a constant and || ■ II2 is the L2(M'^) norm. As in Artiles (2001), Belitser and 
Levit (2001), it is not hard to show that the rate given in (|3| is optimal for the class of 
densities px defined by ^ with arbitrary ps- 

Though this rate appears to be very fast asymptotically, it does not guarantee good 
accuracy for most practical values of n, even if d is moderately large. For example, 
if = 10, we have (logn)*^/^ > n for all n < 10^. As we show below, the assumed 
independence of the sources and orthogonality of A allows us to eliminate the dependence 
of the rate on the dimension d. 

In order to construct our estimator, we first consider the estimation of px when the 
dimension m, the mixing matrix A, and the level of noise are specified; the fact that 
none of these quantities is known is addressed later in this section. 
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Since the columns of A are orthonormal, we have A^X = S + A^e and 
^A„2(x-/ls) = (^^^ exp|-^(x-Asf(x-^s)| 

Substitution of the above expression in (|2| gives: 

^^^^^ " V 2^ ) {~2^^^^^'^ ~ AA^)^[ J ps(s)</'™,^2(s - A^^)ds. 

Now, by independence of the factors, we get: 

Px(x) = Pm,A{^) = j {-^''^(^'^ - AA^)x| n (7fc(a^x) (4) 



where denotes the fcth column of A and 

9k{u) = {ps^ * (pi,a^){u) = I - s)ds. (5) 



We see that to estimate the target density px it suffices to estimate nonparametrically 
each one-dimensional density Qk using the projections of an observed sample Xi, . . . ,X„ 
generated by the model ([T| onto the A;th direction a^. 

Note that, similarly to ([2]), the density gk is obtained from convolution with a one- 
dimensional Gaussian density, and therefore has very strong smoothness properties. To 
estimate gk we will use the kernel estimators 



i=l 



with a bandwidth /i„ x (logn)~^/^ and the sine function kernel K(u) = smu/jru. We 
could also use here any other kernel K whose Fourier transform is bounded and compactly 
supported, for example, the de la Vallee-Poussin kernel K{u) = (cos(u) — cos(2M))/(7rM^), 
which is absolutely integrable and therefore well suited for studying the Li-error. 

A potential problem of negative values of g^ in the regions where the data are 
sparse can be corrected using several methods (see, for example. Hall and Murison 1993; 
Glad, Hjort and Ushakov 2003). For our practical implementation we will follow the 
method suggested in Hall and Murison (1993), and our estimators will be obtained by 
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truncating the estimator gk{x) outside the "central" range where it is nonnegative, and 
then renormahzing. 

Once each "projection" density gk is estimated by the corresponding kernel estimator 
(|6|, the full target density px is then estimated using Q: 

Pn,n.A^) = [■^j 1"^''^^^'^ ~ AA^)^j J]^fc(a^x). (7) 

The following proposition proved in the Appendix summarizes the discussion for the 
case when A and cr^ are known. 

Proposition 1 Consider a random sample of size n from the density px given by with 
known A and a^. Then the estimator ^ with g^ given in ^ has the mean integrated 
square error of the order {lognY^'^ /n: 

\ n 

Note that neither m nor d affect the rate. Note also that Proposition [T] is valid with 
no assumption on the distribution of the factors. The identifiability assumption (that at 
most one factor is allowed to have a Gaussian distribution) is not used in the proof, since 
we do not estimate the matrix A. 

So far in this section we have assumed that A and cr^ are known. When o"^ is an 
unknown parameter, it is still possible to obtain the same rates based on the approach 
outlined above, provided that the dimensionality reduction holds in the strict sense, i.e., 
m < d. Indeed, assume that we know an upper bound M for the number of factors m 
and that M < d. For example, if the dimensionality reduction in the strict sense holds, 
we can take M = d — 1. The assumption M < d is only needed to estimate the variance 
of the noise; if cx^ is known we allow M = d. 

The assumed independence and finite variance of the factors imply that their 
covariance matrix, which we will denote by W, is diagonal. The covariance matrix Sx of 
X is given by: 

Sx = AWA^ + a^d- 

If Ai(Sx) > ■ ■ ■ > Arf(Ex) denote the eigenvalues of Sx sorted in decreasing order, then 
Aj(Sx) = Wi + 0"^, for i = 1, . . . , m, and Ai(Sx) = cr^ for i > m, where Wi denote the 
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diagonal elements of W. We estimate a' 



^ with 



1 



d 



d-M 



i=M+l 



where Aj, i 



1, . . . , (i, are the eigenvalues of the sample covariance matrix Ex arranged 



in decreasing order. Note that o"^ is a root-n consistent estimator. Indeed, the root-n 



where C and D are any symmetric matrices and ||-D||2 is the spectral norm of D. The 
last inequality easily follows from a classical inequality of Fan (1951). 

Using the root-n consistency of o"^, it is not hard to show that the estimation of 
does not affect a slower density estimator rate, and so in what follows we will assume that 
0"^ is known. 

Consider now the case where the index matrix A, and hence its rank m, are unknown. 
We will use a model selection type aggregation procedure similar to the one developed 
recently by Samarov and Tsybakov (2007) and, more specifically, the mirror averaging 
algorithm of Juditsky, RigoUet and Tsybakov (2008). We aggregate estimators of the 
typs ^ corresponding to candidate pairs {k.Bk)-, k = 1,...,M. Here is a d x k 
matrix whose columns are the first k (in the decreasing order of eigenvalues) orthonormal 
eigenvectors of the spectral decomposition of Sx — o'^Id (and thus of Sx)- For the true 
rank m, it follows from Lemma A.l of Kneip and Utikal (2001) that, provided that m 
largest eigenvalues of Sx — cr^I^ are distinct and positive and the 4th moments of the 
components of X are finite, Bm is a i/n-consistent estimator of A. 

We can now define the aggregate estimator, applying the results of Juditsky, Rigollet 
and Tsybakov (2008) in our framework. We split the sample Xi, . . . , X„ in two parts, 
Vi and V2 with rii = Card(Pi), 77-2 = Card('E'2); n = rii + n2. From the first subsample 
Vi we construct the estimators 



for /c = 1, . . . , M, where hkj denotes the jth column of B^, the estimators gj{-) are defined 
in (pj), and both Bk and gj{-) are based only on the first subsample Pi. 





(9) 
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The collection C of density estimators fc = l,...,M| of the form (9) 

constructed from the subsample T)\ can be considered as a collection of fixed functions 
when referring to the second subsample T>2- The cardinality of this collection is M. 

To proceed further, we need some more notation. Let G be the simplex 



M 



= <^ G M^^ : ^fc = 1, ^fc > 0, A; = 1, . . . , M L 



fe=i 



and 



where 



u(X) = (mi(X),...,mm(X))^, 

Ufc(x) = y p2(x)dx-2pfc(x). (10) 



Introduce the vector function 

H(x) = (pi(x),...,pm(x))^. 

As in Juditsky, Rigollet and Tsybakov (2008), the goal of aggregation is to construct a 
new density estimator p„(x) of the form 

p„(x) = ^^H(x) (11) 

which is nearly as good in terms of the L2-risk as the best one in the collection C Using the 
mirror averaging algorithm, the aggregate weights Q are computed by a simple procedure 
which is recursive over t 
are defined in the form: 



which is recursive over the data. Starting with an arbitrary value Q^^^ E 0, these weights 



^ £=1 



where the components oi 6^^ are given by 

exp (-/3-i^J^iMfe(X^) 

dl - k = l,...,M, (13) 

ESiexp (-/?-! E:=i«t(X.)) 

with Xr, r = 1, . . . , 712, denoting the elements of the second subsample V2. Here /3 > is 
a random variable measurable w.r.t. the first subsample Vi. 

Our main result about the convergence of the aggregated density estimator is given 
in Theorem [l] below. We will consider the norms restricted to a Euclidean ball B C M*^: 
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\Ib = /b/'Wc?x, \\f\\oo,B = supigs \f{t)\ for / : M'^ ^ M. Accordingly, in Theorem 
[l] we will restrict our estimators to B and define pn by the above aggregation procedure 
where pk{x) are replaced by pk{x)I{x G B}. Here /{■} denotes the indicator function. 

Clearly, all densities px of the form Q are bounded: ||px||oo,B < Lq := (27rcr^)"'^/^ for 
all m and A. We set Li = maxfc=i_. ,_jvf ||Pfc||oo,s and L = max(Lo,I/i). In the Appendix 
we prove that 

mPk\\oo,B < L' , \/k = l,...,M, (14) 

where L' is a constant. 

Theorem 1 Lei px be the density of X in model Assume that covariance matrix 
Sx has distinct eigenvalues and the 4th moments of the components of X are finite. Let 
n2 = [cn/y/logn\ for some constant c > such that 1 < n2 < n. Then for f3 = 12L, the 
aggregate estimator pn with 6 obtained by the mirror averaging algorithm restricted to a 
Euclidean ball B satisfies 

nPn- Prills = 0(^^^^^y (15) 

as n ^ +00. 



The theorem implies that the estimator p„ adapts to the unknown m and A, i.e., has 
the same rate, independent of m and (i, as in the case when the dimension m and the 
matrix A are known. The proof is given in the Appendix. 

Remarks. 

1. Inspection of the proof shows that Theorem [T] holds with no assumption on 
distributions of the factors (except that at most one of them can be Gaussian). In 
particular, we do not need them to have densities with respect to the Lebesgue measure. 

2. We state Theorem [l] with a restricted L2-norm || ■ \\2,b- Under mild assumptions 
on the densities of the factors we can extend it to the L2-norm on W^. Indeed, inspection 
of the proof shows that Theorem [T] remains valid for balls B of radius which tends 
to infinity slowly enough as n — ^ oo. If px behaves itself far from the origin roughly as 
a Gaussian density (which is true under mild assumptions on factor densities), then the 
integral of p\ outside of the ball reduces to a value smaller than the right hand side of 
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3 Application to nonparametric classification 



One of the main applications of multivariate density estimators is in the supervised 
learning, where they can be used to construct plug-in classifiers by estimating the 
densities of each labeled class. The difficulty with such density-based plug-in classifiers 
is that, even for moderately large dimensions d, standard density estimators have poor 
accuracy in the tails, i.e., in the region which is important for classification purposes. In 
this section we consider the nonparametric classification problem and bound the excess 
misclassification error of a plug-in classifier in terms of the MISE of class-conditional 
density estimators. This bound implies that, for the class-conditional densities obeying 
the noisy IFA model ([2]), the resulting plug-in classifier has nearly optimal excess error. 

Assume that we have J independent training samples {Xji, . . . , Xj^.} of sizes Nj, 
j = 1, . . . , J, from J populations with densities /i, . . . , /j on M''. We will denote by V the 
union of training samples. Assume that we also have an observation X G M'' independent 
of these samples and distributed according to one of the /-,. The classification problem 
consists in predicting the corresponding value of the class label j G {1, . . . , J}. We define a 
classifier or prediction rule as a measurable function T(-) which assigns a class membership 
based on the explanatory variable, i.e., T : M"^ — {1, . . . , J}. The misclassification error 
associated with a classifier T is usually defined as 



where ¥j denotes the class-conditional population probability distribution with density 
fj, and TTj is the prior probability of class j. We will consider a slightly more general 
definition: 



where i? is a Borel subset of M . The Bayes classifier T* is the one with the smallest 
misclassification error: 



In general, the Bayes classifier is not unique. It is easy to see that there exists a Bayes 
classifier T* which does not depend on B and which is defined by 



J J 






Rb{T*)= mm Rb{T). 



7rT*(x)/T*(x)(x) 



l<j<J 



mm 7rj/j(x), V x G M"'. 
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A classifier trained on the sample T> will be denoted by Tx)(x). A key characteristic of 
such a classifier is the misclassification error Rb{Tx>)- One of the main goals in statistical 
learning is to construct a classifier with the smallest possible excess risk 

S{Tv) =¥.RBiTv) - Rb{T*). 

We consider plug-in classifiers T(x) = Tx)(x) defined by: 

(x) /f (x) (x) = mm^ -Kj fj (x) , V X G M'^ 

where fj is an estimator of density fj based on the training sample {Xji, . . . ,Xjjv^,}. 

The following proposition relates the excess risk S{T) of plug-in classifiers to the rate 
of convergence of the estimators fj. 



Proposition 2 



S{f)<J2^,E [ |/,(x)-/,(x)|rfx 

7 = 1 



Proof of the proposition is given in the Appendix. 

Assume now that the class-conditional densities follow the noisy IFA model Q with 
different unknown mixing matrices and that Nj x n for all j. Let B he a Euclidean ball 
in M.^ and define each of the estimators fj using the mirror averaging procedure as in the 
previous section. Then, using Theorem [1} we have 

|/,(x) - /,(x)|rfx < vTbI - f,hB = O (^^^^^) 



as n — i> oo, where \B\ denotes the volume of the ball B. Thus, the excess risk S{T) 
converges to at the rate {lognY^'^/ y/n independently of the dimension d. Following the 
argument in Devroye, Gyorfi and Lugosi (1996) or Yang (1999), it is easy to show that 
this is the best obtainable rate for the excess risk, up to the log^^"^ n factor. 



4 The algorithm 

In this section we discuss numerical aspects of the proposed density estimator. 

Clearly, one-dimensional kernel density estimators gk with given bandwidth, say 
hn oc (logra)"^/^, can be computed in a fast way. Similarly, estimating the variance 



12 



of the noise component in the noisy IFA model amounts to implementing a single singular 
value decomposition (SVD) of the dx n data matrix D = (Xi, . . . , X^). Let D = VAU'^ 
be the SVD of D, where A is the diagonal matrix and U, V are matrices with orthonormal 
columns. We assume w.l.o.g. that Xj are centered. Then an estimate of the variance al 
with rank k approximation, A; < M, is given by 

1 

^'=rf3lE^'' k = l,...,M (16) 

i=k+l 

where Sj are the diagonal elements of A/^/n sorted in the decreasing order. When the index 
matrix A is unknown, the rank k approximation Bk of A used in the density estimator 
Pk, cf. can be easily obtained from the SVD of D. Indeed, we can take = V^, 
where Vk is formed by the first k columns of V. So, accurate computation of the density 
estimators ^ is feasible, reasonably fast and does not require a huge amount of memory 
even for very large n and d. 

Therefore, the complexity of the procedure is controlled by the numerical 
implementation of the mirror averaging algorithm which, in particular, requires the 



computation of the score functions Uk{'x), involving integration of pl, see (10). The 
numerical implementation of the integral of the square of density estimates pk in M*^ can 
be realized by means of cubature formulas. Recall that for the calculation of J pfe(x)^(ix, 
say, a cubature has the form Xlili "^iPfcl^*) "where Xj are the nodes and Wi are the 
associated weights. In our setting, M integrals involving the i?fc-projections need to be 
calculated for each 6k, so formulas with fixed nodes will be actually more economical. On 
multidimensional domains, product quadratures quickly become prohibitive (they grow 
exponentially in d for the same accuracy), and therefore this approach is not realistic. 

An alternative is to use Monte-Carlo integration methods which require much more 
evaluations but do not depend on the dimension d, or a more clever implementation 
through Gibbs sampling by generating samples from some suitable distribution for the 
Monte-Carlo estimates. Several Gibbs sampling strategies were considered in the present 
work. The fastest one was to generate samples directly from pk, so that 



f 1 ^ 



where Q is the number of generated i.i.d. random realizations x, from the density pk- 
The overall algorithm implementing our approach is the following: 
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Algorithm 1 



- Compute the singular value decomposition of the data array D: 



D = VAU^, 

with matrices U, V, and A having dimensions n x d, d x d and d x d, respectively; 



for k=l,. . .,M 



Take Bj. as the matrix built from the first k columns of V; 



Compute a1 from (16) 



Compute the density estimator Pfc(x) from ^ based on the subsample Vi ; 



Compute Uki'x.) from (10) 
end for 



Estimate the weights through (12)-(13) and output the final density estimator (11). 



To speed up computations, one-dimensional kernel density estimators gj, j = 1, . . . , M, 
in ^ are obtained through a Fast Fourier Transform algorithm, cf. Silverman (1982). 
The algorithm for estimating J pl{x.)d:si in (10) goes through the following steps. 

Algorithm 2 - Generate Q independent random numbers, y^j!\ i = 1, . . . ,Q, from each 
Qk, k = 1, . . . , M, and compute the corresponding density Qkiyf') by kernel density 
estimation; 

- Generate the corresponding ci-dimensional x*-*^ as x*-*^ = -B^y*-*^ + {Id — BkB'J:)e^^\ 

y('') = {y^\ . . . ,yf'), with e*^*^ being random numbers extracted from a ci-variate 
Gaussian density function having mean and diagonal covariance (5"^/^; 

- Compute pfc(x'^*)) through (fo 



- Output the estimate ^ Yl%iPk{^^^^) of the integral / pl{x)dx. 

Here Q is chosen so that generating more random numbers does not change the estimated 
value of the integral within a predefined tolerance. Random numbers generated from 
the density estimator are based on the corresponding cumulative functions and pre- 
computed on a high resolution grid with linear interpolation. 
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5 Simulations and examples 



5.1 Density estimation 

To study the performance of density estimates based on our noisy IFA model we have 
conducted an extensive set of simulations. We used data generated from a variety of 
source distributions, including subgaussian and supergaussian distributions, as well as 
distributions that are nearly Gaussian. We studied unimodal, multimodal, symmetric, 
and nonsymmetric distributions. Table [T] lists the basic (one-dimensional) test densities 
from which multidimensional density functions are built. 

Experiments were run up to dimension d = 6 with a number of independent factors 
equal to 1 and 2. Random i.i.d. noise was generated and added to the simulated signals 
so that the Signal to Noise Ratio (SNR) was equal to 3, 5 or 7. The kernels K for 
density estimators gj in ^ were the Gaussian, the sine and de la Vallee-Poussin kernels; 
the bandwidth h was chosen as h = a / \og^^'^ n. To obtain legitimate (i.e., nonnegative) 
density functions they were post-processed by the procedure of Hall and Murison (1993). 
The size of the sample was chosen as n=200, 300, 500, 700, 1000, 2000 and 4000. The 
following criterion was used for evaluating the performance of density estimators: 

r _inn/^i / (Pcstimatcd(x) - Px{^)f dx \ 

The performance of IFA density estimation was compared with kernel smoothing (KS) 
(see, e.g.. Wand and Jones, 1995) as implemented in the KS package available in R. IFA 
density estimation has been implemented in the MATLAB environment and the scripts 
are available upon request. We note that KS can be effectively computed only up to 
d = 6 if the FFT algorithm is used. In contrast with this, our method has no practical 
restrictions on the dimension. This is due to the use of a proper Gibbs sampling for 



estimating integrals (10); in addition the density estimate can be computed on any set in 
W^, not necessarily on a lattice imposed by the FFT. 

We conducted numerical experiments by generating random samples of size n from the 
independent components of Table [T| random mixing matrices, and different realizations 
of Gaussian noise. In particular, the elements of the mixing matrix A were generated 
as i.i.d. standard Gaussian random variables and then the matrix was orthonormalized 
by a Gram-Schmidt procedure. We perform 50 Monte-Carlo replications for each case 
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and output the corresponding values /i. Results over all experiments show a very good 
performance of Noisy IFA. For brevity we only show some representative figures in the 
form of boxplots. We display different test functions to demonstrate good performances 
over all of them. Moreover, we present only the case of SNR=3 because it seems to be more 
interesting for applications and because improvement of performance for both methods 
flattens the differences. Figure [l] shows the case of ci = 2, SNR=3 and test function 2 (chi- 
square function), where the superiority of the aggregated Noisy IFA with respect to KS is 
clear. Figure [2] shows analogous boxplots in the case d = 3 and test function 3 (mixture of 
Gaussians), again when SNR=3. This case is interesting because the dimension d is larger, 
whereas the number of independent factors is kept constant with respect to the previous 
experiment. Figure [2] clearly shows that difference of performance between Noisy IFA and 
KS increases in favor of the former. Finally, Figure |3] shows boxplots in the case d = 5 and 
test functions 5 and 6 (chi-square and Student, respectively), again for SNR=3. Better 
performance of Noisy IFA with respect to KS is conflrmed, especially when d increases. 

Finally, Table |2] shows typical computational times of aggregated IFA and KS density 
estimators. Executions were run on a single core 64-bit Opteron 248 processor with 
MATLAB version R2008a, R 2.9.0 and Linux Operating System. We see that the 
aggregated IFA is more than one order of magnitude faster than KS. 

5.2 Classification: a real data example 

In this subsection we apply the nonparametric classification method suggested in Section [3] 
to real data. We consider only a two-class problem and we assume that the class- 
conditional distributions follow the noisy IFA model. To evaluate the performance of 
our approach in comparison with other classification methods that are often used in this 
context, we have also applied to these data three other classification procedures, one 
parametric and two nonparametric, namely: 

LDA (Linear Discriminant Analysis). Class-conditional density functions are supposed 
to be Gaussian with a common covariance matrix among classes, and the two classes 
are separated by a hyperplane in ci-dimensional space. 

NPDA (Nonparametric Discriminant Analysis, Amato et al. 2003). In this procedure 
class-conditional density functions are estimated nonparametrically by the kernel 
method, assuming that the density obeys an ICA model. The kernel functions 
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mentioned above in this section were considered in the experiments. The smoothing 
procedure uses an asymptotic estimate of the bandwidth and a correction for getting 
non-negative density estimators. 

FDA (Flexible Discriminant Analysis; Hastie, Tibshirani and Buja 1994). This method is 
also nonparametric, but classification is performed through an equivalent regression 
problem where the regression function is estimated by the spline method. 

We have compared the performance of the classification methods on a data set 
from a remote sensing experiment. MSG (METEOSAT Second Generation) is a series 
of geostationary satellites launched by EUMETSAT (EUropean organization for the 
exploitation of METeorological SATellites) mainly aimed at providing data useful for the 
weather forecast. A primary instrument onboard MSG is SEVIRI, a radiometer measuring 
radiance emitted by Earth at d = 11 spectral channels having a resolution of 3 Km^ at sub- 
satellite point. Essentially, SEVIRI produces 11 images of the whole Earth hemisphere 
centered at 0° degrees latitude every 15 minutes. Recognizing whether each pixel of the 
images is clear or affected by clouds (cloud detection) is a mandatory preliminary task for 
any processing of satellite data. In this respect multispectral radiance data are prone to 
improve the detectability of clouds, thanks to the peculiar behavior of clouds in selected 
spectral bands. Figure |4] shows an RGB image of the Earth taken by SEVIRI on June 
30th 2006 UTC time 11:12 composed by 3 selected spectral channels. The problem of 
cloud detection is to infer the possible presence of clouds for each pixel of the images. In 
order to accomplish this task by discriminant analysis a training set has to be defined. 
Here we take the training set from a cloud mask produced by sensor MODIS onboard 
NOAA EOS series satellites. MODIS sensor is endowed with a product (MOD35) aimed 
to produce a reliable cloud mask in many pixels (confident classification in the terminology 
of MOD35). The algorithm underlying MOD35 is based on physical arguments, with a 
series of simple threshold tests mostly based on couples of spectral bands (see Platnick et 
al. (2003) for details of the algorithm). Troubles in dealing with the increasing number 
of spectral bands of current and next generation instrumentation from the physical point 
of view is fostering investigation of statistical methods for detecting clouds. Due to the 
very different spectral characteristics of water and land pixels, two separate independent 
classifications are performed for the two cases. Over land the MOD35 data set is composed 
of 11289 cloudy pixels and 19022 clear ones; for water pixels we have 14585 cloudy pixels 
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and 16619 clear ones. We assume that labels assigned by MOD35 are the truth. 

In order to evaluate the methods, for each case (land and water) we divide the data 
set randomly into two parts; a training set of about 2/3 of the pixels used for estimation 
and learning (training set) and a test set of about 1/3 of the pixels used for evaluation of 
the prediction capability of the estimated discrimination. The split was done 50 times in 
such a way that the proportion of clear and cloudy pixels of the whole original data set 
was respected. The results are summarized as boxplots in the following figure. 

Figure |5] shows the boxplots of misclassification errors for the various classification 
methods over 50 random splits for land (left) and sea (right). For the land pixels, apart 
the NPDA method which has a poor behavior, none of the other three methods clearly 
stands out and they all perform essentially well. For the sea panels (cf. the right panel of 
Figure |5| we get different conclusions. Here the boxplots clearly indicate that our noisy 
IFA classification method has the smallest error. Finally, Figure [6] shows the cloud mask 
overimposed to the analyzed area. 

6 Conclusions 

We have considered multivariate density estimation with dimensionality reduction 
expressed in terms of noisy independent factor analysis (IFA) model. In this model 
the data are generated by a (small) number of latent independent components having 
unknown non-Gaussian distributions and observed in Gaussian noise. 

Without assuming that either the number of components or the mixing matrix are 
known, we have shown that the densities of this form can be estimated with a fast rate. 
Using the mirror averaging aggregation algorithm, we constructed a density estimator 
which achieves a nearly parametric rate log^''^ n/ i/ra, independent of the dimension of the 
data. 

We then applied these density estimates to construct nonparametric plug-in classifiers 
and have shown that they achieve, within a logarithmic factor independent of ci, the best 
obtainable rate of the excess Bayes risk. 

These theoretical results were supported by numerical simulations and by an 
application to a complex data set from a remote sensing experiment in which our IFA 
classifier outperformed several commonly used classification methods. Implementation of 
the IFA-based density estimator and of the related classifier is computationally intensive; 
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therefore an efficient computational algorithm has been developed that makes mirror 
averaging aggregation feasible from computational point of view. 



APPENDIX: PROOFS 



Proof of Note that (2| implies that the Fourier transform 9?x(u) = JjgdPx(x)e*''^"(ix 
of the density px satisfies the inequality 

Iv'xlu)! <e-'^'""ll'/2 (A.l) 

for all u G M"', where || ■ || denotes the Euclidean norm in M"'. Define the kernel estimator 

'X,; - x' 



h 



I, such that K{-x) = 11^=1 -^0(2^?=), = {xi,X2, ...,Xd), where 
for X 0, and K{0) = l/vr, with the Fourier transform 



sin X 

TTX 



with the kernel K : R'^ ^ 1 
Kq is the sine kernel: -^'0(2;) 
= /(|t| < 1). 

Using Plancherel theorem and Theorem 1.4 on p. 21 of Tsybakov (2009), we have 

1 



eiip:-px| 



< 



(27r)^ 
1 



(2 



7r 



l-$^(/iu)n(^x(u)|'rfu+- / |$^(/iu)|'(iu 



n 



where (pn{^ = X]j=i ^ " is the empirical characteristic function and $ (v) is the 
Fourier transform of K. Note that ^^(y) = YYj=i -^{l^jl — 1} where Vj are the components 
of V G M*^. Now, for the bias term we have, using (A.l ), 

||l-<|.^(/iu)|Vx(u)|^du = y"/|3j:|M,|>^||^x(u)pdu 

,/.{B.:,,,>i}.-VV-.,„ 



< e 



d/2 



Next, the variance term 



i/i*-(Mp.u4n//{Ki<i} 



n 
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Combining the last two expressions, we get 



E||p:-Px||^<C e 



+ 



with some constant C > 0. Taking here h = cr(41og?T,)~^/^, we get ([3|. □ 
Proof of Proposition [1} W.Lo.g. we will suppose here that are the canonical basis 
vectors in M"'. Note first that the proof of ([3| with d = 1 implies that the estimators ([6 
achieve the convergence rate of (logn)^/^/n for the quadratic risk: 



E\\gk - gkWl = O {{log ny/yn) Wk = l,...,m. (A.2) 
Denoting C > a constant, not always the same, we have for the estimator ([t]) 



^,A-Px\\l < CE 
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\9k -9k\\l JJ ll^il 
j=fc+i 
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- '^^f^\\\9k - 9k\\l n Il^ill2]> 



j=fc+i 

where HiL; — ^ when / > u and we have used that the L2-norms of gj are bounded for 
all j = 1, . . . ,m. The latter is due to the fact that, by Young's inequality (see, e.g., Besov 
et al., 1979), \\gj\\2 < H^i.^Hb / Ps^ = ll^i.aHh- 

We now evaluate the L2-norms of gj. By separating the diagonal and off-diagonal 
terms, 



\9j\\l 



Y —Y 

h 



(A.3) 



with the convolution kernel K* = Kq * Kq and we write for brevity Yi = aJXj. The 
second term in ( A.3[ ) is a fZ-statistic that we will further denote by Un- Since all the 
summands j^K* (^^^^) in f/„ are uniformly < C/h, by Hoeffding inequality for U- 
statistics (Hoeffding 1963) we get 



2.2^ 



P{\Un - E{Un)\ >t) < 2exp{-cnh^t 



(A.4) 
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for some constant c > independent of n. On the other hand, it is straightforward to see 

(A.5) 



that there exists a constant Co such that \E{Un)\ < Cq. This and (A.4) imply: 

P{\Un\ > 2Co) < 2exp{-cnh^) 



for some constant c' > independent of n. From ( |A.3 ) and (A.5) we get 

P{A) < 2rfexp(-cW), (A.6) 

for the random event A = {3j : \\gj\\l > Ci}, where Ci = 2Cq + / Kl/{nh). 

Using (A.6), ( |A.2[ ) and the fact that \\gj\W and ||5'j||2 are uniformly <C/h we find 



E 



\\9k~gk\\i n 

j=k+l 



< E \\gk-gk\\l n ll^jll2^M} 
j=k+i 

+ {C,r-'¥. [\\g,-g,\\ll{A'^}] 

< {C/hr-''+'P{A} + C(log ny/yn 

< Ch-^'^-''+^^ exp(-c'n/i2) + C(log nf'^/n 

< C{logny/yn. 



Thus, the proposition follows. 



□ 



Proof of (14). We will show first that for some constant C > and for all j = 1, M 

P(ll^.lloo,[-i,i]>C)<^^4^, (A.7) 

where ||/||oo,[-i,i] = sup(g[_i_]^] 1/(^)1 for / : M — M. Note that the sine kernel Kq satisfies 
the inequality |-K'o('u)| < l/vr for all -u G M. Now because 

||^j||oO,[-l,l] < E||^j||oo,[~l,l] + ll^i - E^j'||oo,[-l,l] 



and 



we have 



|E^,(t)| 



Ko{u)gj{t — uh)du 



<-, Vt G 

7T 



P(||^,||oo,hi,i] > C) < P (^\gj - E^,|U,[-i,i] 
Now for ri{t) := gj{t) — E^j(t) we have 



> C 



TT 



(A.8) 
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t + A- z' 
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t - z 



g,{z)dz (A.9) 
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for t, A G [—1, 1], where we used that \Kq{u)\ < Cq with some constant Cq for all m G M. 
Also, the standard bound for the variance of kernel estimator gj gives 



9l 



VtG [-1,1] 



(A.IO) 



with C2 = J KQ{u)du. Now (A. 9) and (A.IO) verify conditions of the following lemma. 



Lemma 1 (Ibragimov and Has'minskii 1982, Appendix 1) Let rjit) be a continuous real- 
valued random function defined on M'^ such that, for some < H < 00 and d < a < 00 
we have 

E\r]{t + A) - r/(t)|" < Af , V t, A G 
E\rj{tT<H, VtGM'^. 

Then for every 6 > and to G M'^ such that \\to\\ < D, 



E 



sup \ri{t) - ri{to)\ 

t:||f-io||<(5 



where Bq is a finite constant depending only on a and d. 



Applying this lemma with d = 1, a = 2, H = -A, to = 0, and 5 = 1, we get 



^1/2 ^ 

L-2 ^ O4 



Applying now in (A. 8) Markov inequality and choosing C = C4 + l/vr, we obtain (A. 7) 
Next, assume w.l.o.g. that B is the unit ball in W^. We note that (A. 7) implies 
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Using this and definition (|9| of pk we have that 
E||pfe||oo,B < (27ra2)('^-^)/2E 
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where we also used the fact that ||^j||oo,[-i,i] < (tt/?-) ^ for all j = l,...,k. Since 
h X (logn)""^/^, we get that, for some constant L^, 

E\\M^.R<Lk, VA; = 1,...,M, 



and (14) follows with L' = max(Li, L2, Lm)- D 
Proof of Theorem [l| To prove the theorem we use Corollary 5.7 in Juditsky, RigoUet 
and Tsybakov (2008), which implies that for j3 = 12L the corresponding aggregate 
estimator pn satisfies: 

^vA-P^Wl < , min , -^'xil^ + (A.ll) 

k=l,...,M ii ! 1. 

where denotes the expectation over the second, aggregating subsample. Here p^_^ ^ 
are the estimators constructed from the first, training subsample Vi, which is supposed 
to be frozen when applying the result of Juditsky, RigoUet and Tsybakov (2008) and the 



inequality holds for any fixed training subsample. Taking expectation in inequality (A.ll ) 
with respect to the training subsample, using that, by construction, »„ and p„ 1, f, vanish 
outside B, and interchanging the expectation and the minimum on the right hand side 
we get 

E||Pn-px||2,iJ < .^in ,E||Pni.fcA ~Px||2,iJ + ^? 

k=l,...,M 1, ) K ^2 

where now E is the expectation over the entire sample. 



Recalling now that M < d, n2 = [cn / i/log n] , and that E/5 < C by (14), we obtain 



,2 , C(l0gn)V2 



E\\pn - px\\2,B < E||p -px||2,B + • (A.12) 

k=l,...,M 1: : K Yl 

Now, 

min E||p < n\v^.A-V^\\lB. (A.13) 

where A = 13^ is the estimate of A with the true rank m and we set for brevity 
Pm,A = Pnum,A- ^ince px = Pm,A^ wc have 

\\p.ra,A - P^WIb < 2(||P„,i - Pm,A\\lB + \\Pm,A " Pm.Alla.e)- (A.14) 

Since rii = n(l + o(l)), by Proposition [l] we get 

E\\p^,A-Pm,A\\lB = 0{i\ognY/yn). (A.15) 
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It remains to prove that 



nP„a-P-^AlB = 0{{\ognfl^ln). 



(A.16) 



Denoting Gy^{A) = (^;;^)^°' ""^^^exp | — ^x-^(I(i — AA^)^^ and by and the columns 
of A and A, respectively, we can write (see (M) and (^1), 



< 



C|| J]^,(ajx) - J] (7,-(aJx)||2,B + C|| J]^,(ajx) - J] (7,(aJx)||2,B + 



i=i 



llCxli) n^?j(ajx) - ^.(A) J]^?,(aJx)||2,B =: /i + /2 + /a- 

i=i i=i 

As in the proof of Proposition 1 we get "Rlf = O {{log nY ^"^ /n), i = 1,2. Next, we show 
that E/| = 0{l/n). We write Jg < Jg,! + 13,2 where 

/3,1 = -Gx(A)||2,B|in^?J-(^J^)ll2,B, 
m m 



To bound these terms we will systematically use the fact that || Yl'^j^^ gj{^J^)\\2,B ^ C for 
all 1 < A; < Z < m (and the same with a^ instead of a^). This fact, the definition of G'x(-) 
and the boundedness of the Frobenius norms of A and A imply that /3.1 < C||A — y4||i?, 
where ||M||i;' denotes the Frobenius norm of matrix M. Now, E||A — = 0{l/n), 
which follows from Lemma A.l of Kneip and Utikal (2001) and the assumed moment 
condition on X. Thus, EJ|i = 0{l/n). We also get E/I2 = 0{l/n). This follows from 
the Lipschitz continuity of gj{-) and from the fact that (cf. proof of Proposition 1): 



k=l 



k-1 



ajx) 



2,B 



l^fc(afx) -^fc(a^x)||2 R 



n 9j{a^J^) 

j=k+i 
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2,B 



So, we have E/| = 0{l/n). This finishes the proof of (A.16). 
Inequalities ( |A.14| ), ( |A.15[ ), and ( |A.16D give 



E\\p.-pj,\\ls<0{{\ognY/yn 
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which together with (A. 12) and (A. 13) completes the proof. □ 



Proof of Proposition [2} For any classifier T we have 

Rb{T) - Rs{T*) = J2^J (J(T(x) ^ j) - J(T*(x) ^ j))/,(x)rfx 

,=1 Jb 



3 

J 



X^TT, / (/(T*(X) = j) - /(r(x) = j))/.(x)cix 
/ (7^T-{x)/t-(x)(x) - 7rT(x)/T(x)(x))rfx 

Jb 



Therefore, the excess risk of the plug-in classifier T can be written in the form 

S{f) = EiRsif)) - Rb{T*) 

= E {iTT'fT'i^) - T^fffi^) + '^fffi^) - T^fffi^))d^ (A.17) 
Jb 

where we omit for brevity the argument x of T* and T. Note that, by the definition of 
T, for all X G M"^ we have: 

7rT./T*(x) - TTfff{x.) + Tlfffiyi) - TTfffi^) < 7rT*/T*(x) - 7rT*/T*(x) + rrf\ff{x.) - ff{x.) 

J 



Combining the last display with (A. 17) proves the proposition. □ 
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Index 



1 
2 
3 
4 

5 
6 
7 



Test function 



6^(0,1) 

0.5^(-3,l) + 0.5^(2,1; 
0.47(5) + 0.67(13) 

X'(8) 
t(5) 

Double exponential : exp(- 



Table 1: List of basic functions considered for the numerical experiments. G{(1-iT) stands 
for Gaussian distribution with mean q and standard deviation r; x^(r) indicates chi-square 
density function with r degrees of freedom; 7(r) is Gamma distribution of parameter r; 
t{r) is Student distribution with r degrees of freedom. 



Experiment 


Aggregated IFA 


KS 


= 2, n = 500 


0.3 


3 


d = 3, n = 500 


0.9 


15 


= 5, n = 500 


4 


120 



Table 2: Computational time (sec) of aggregated IFA and KS for some test configurations. 
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Figure 1: Boxplot of the error criterion Ji (Eq. (17)) in the case d = 2, Signal to Noise 
Ratio 3 and test function 2 for several sample sizes. 



Figure 2: Boxplot of the error criterion Ii (Eq. (17)) in the case d = 3, Signal to Noise 



Ratio 3 and test function 3 for several sample sizes. 



Figure 3: Boxplot of the error criterion Ji (Eq. (17)) in the case d = 5, Signal to Noise 



Ratio 3 and test functions 5 and 6 for several sample sizes. 



Figure 4: RGB image obtained from the SEVIRI sensor onboard MSG on June 30th 2006 
UTC Time 11:12. 



Figure 5: Boxplot of the misclassifications for the considered classifiers. Results refer to 
land (left) and water (right) pixels of the remote sensing data. 



Figure 6: Cloud mask estimated over a part of the region in Fig. |4]by Noisy IFA. Black: 
area not subject to classification; dark gray: pixels over water classified as clear; light 
gray: pixels over land classified as clear; white: pixels over land or sea classified as cloudy. 
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Figure |2] 
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Figure |3] 
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Figure |4] 
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Figure |5] 
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